22 research outputs found
Self-Attention Temporal Convolutional Network for Long-Term Daily Living Activity Detection
International audienceIn this paper, we address the detection of daily living activities in long-term untrimmed videos. The detection of daily living activities is challenging due to their long temporal components, low inter-class variation and high intra-class variation. To tackle these challenges, recent approaches based on Temporal Convolutional Networks (TCNs) have been proposed. Such methods can capture long-term temporal patterns using a hierarchy of temporal convolutional filters, pooling and up sampling steps. However, as one of the important features of con-volutional networks, TCNs process a local neighborhood across time which leads to inefficiency in modeling the long-range dependencies between these temporal patterns of the video. In this paper, we propose Self-Attention-Temporal Convolutional Network (SA-TCN), which is able to capture both complex activity patterns and their dependencies within long-term untrimmed videos. We evaluate our proposed model on DAily Home LIfe Activity Dataset (DAHLIA) and Breakfast datasets. Our proposed method achieves state-of-the-art performance on both DAHLIA and Breakfast dataset
Selective Spatio-Temporal Aggregation Based Pose Refinement System: Towards Understanding Human Activities in Real-World Videos
Taking advantage of human pose data for understanding human activities has
attracted much attention these days. However, state-of-the-art pose estimators
struggle in obtaining high-quality 2D or 3D pose data due to occlusion,
truncation and low-resolution in real-world un-annotated videos. Hence, in this
work, we propose 1) a Selective Spatio-Temporal Aggregation mechanism, named
SST-A, that refines and smooths the keypoint locations extracted by multiple
expert pose estimators, 2) an effective weakly-supervised self-training
framework which leverages the aggregated poses as pseudo ground-truth instead
of handcrafted annotations for real-world pose estimation. Extensive
experiments are conducted for evaluating not only the upstream pose refinement
but also the downstream action recognition performance on four datasets, Toyota
Smarthome, NTU-RGB+D, Charades, and Kinetics-50. We demonstrate that the
skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at
boosting various existing action recognition models, which achieves competitive
or state-of-the-art performance.Comment: WACV202
Self-Attention Temporal Convolutional Network for Long-Term Daily Living Activity Detection
International audienceIn this paper, we address the detection of daily living activities in long-term untrimmed videos. The detection of daily living activities is challenging due to their long temporal components, low inter-class variation and high intra-class variation. To tackle these challenges, recent approaches based on Temporal Convolutional Networks (TCNs) have been proposed. Such methods can capture long-term temporal patterns using a hierarchy of temporal convolutional filters, pooling and up sampling steps. However, as one of the important features of con-volutional networks, TCNs process a local neighborhood across time which leads to inefficiency in modeling the long-range dependencies between these temporal patterns of the video. In this paper, we propose Self-Attention-Temporal Convolutional Network (SA-TCN), which is able to capture both complex activity patterns and their dependencies within long-term untrimmed videos. We evaluate our proposed model on DAily Home LIfe Activity Dataset (DAHLIA) and Breakfast datasets. Our proposed method achieves state-of-the-art performance on both DAHLIA and Breakfast dataset
PDAN: Pyramid Dilated Attention Network for Action Detection
International audienceHandling long and complex temporal information is an important challenge for action detection tasks. This challenge is further aggravated by densely distributed actions in untrimmed videos. Previous action detection methods fail in selecting the key temporal information in long videos. To this end, we introduce the Dilated Attention Layer (DAL). Compared to previous temporal convolution layer, DAL allocates attentional weights to local frames in the kernel, which enables it to learn better local representation across time. Furthermore, we introduce Pyramid Dilated Attention Network (PDAN) which is built upon DAL. With the help of multiple DALs with different dilation rates, PDAN can model short-term and long-term temporal relations simultaneously by focusing on local segments at the level of low and high temporal receptive fields. This property enables PDAN to handle complex temporal relations between different action instances in long untrimmed videos. To corroborate the effectiveness and robustness of our method, we evaluate it on three densely annotated, multi-label datasets: MultiTHUMOS, Charades and Toyota Smarthome Untrimmed (TSU) dataset. PDAN is able to outperform previous state-of-the-art methods on all these datasets
Toyota Smarthome: Real-World Activities of Daily Living
International audienceThe performance of deep neural networks is strongly influenced by the quantity and quality of annotated data. Most of the large activity recognition datasets consist of data sourced from the web, which does not reflect challenges that exist in activities of daily living. In this paper, we introduce a large real-world video dataset for activities of daily living: Toyota Smarthome. The dataset consists of 16K RGB+D clips of 31 activity classes, performed by seniors in a smarthome. Unlike previous datasets, videos were fully unscripted. As a result, the dataset poses several challenges: high intra-class variation, high class imbalance, simple and composite activities, and activities with similar motion and variable duration. Activities were annotated with both coarse and fine-grained labels. These characteristics differentiate Toyota Smarthome from other datasets for activity recognition. As recent activity recognition approaches fail to address the challenges posed by Toyota Smarthome, we present a novel activity recognition method with attention mechanism. We propose a pose driven spatio-temporal attention mechanism through 3D ConvNets. We show that our novel method outperforms state-of-the-art methods on benchmark datasets, as well as on the Toyota Smarthome dataset. We release the dataset for research use